19 research outputs found

    Computational Protein Design: Validation and Possible Relevance as a Tool for Homology Searching and Fold Recognition

    Get PDF
    International audienceBACKGROUND: Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases. METHODOLOGY/PRINCIPAL FINDINGS: WE EXPLORE THIS STRATEGY FOR FOUR SCOP FAMILIES: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000-300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed. CONCLUSIONS/SIGNIFICANCE: For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use

    CAIDE dementia risk score relates to severity and progression of cerebral small vessel disease in healthy midlife adults: the PREVENT-Dementia study.

    Get PDF
    BACKGROUND: Markers of cerebrovascular disease are common in dementia, and may be present before dementia onset. However, their clinical relevance in midlife adults at risk of future dementia remains unclear. We investigated whether the Cardiovascular Risk Factors, Ageing and Dementia (CAIDE) risk score was associated with markers of cerebral small vessel disease (SVD), and if it predicted future progression of SVD. We also determined its relationship to systemic inflammation, which has been additionally implicated in dementia and SVD. METHODS: Cognitively healthy midlife participants were assessed at baseline (n=185) and 2-year follow-up (n=158). To assess SVD, we quantified white matter hyperintensities (WMH), enlarged perivascular spaces (EPVS), microbleeds and lacunes. We derived composite scores of SVD burden, and subtypes of hypertensive arteriopathy and cerebral amyloid angiopathy. Inflammation was quantified using serum C-reactive protein (CRP) and fibrinogen. RESULTS: At baseline, higher CAIDE scores were associated with all markers of SVD and inflammation. Longitudinally, CAIDE scores predicted greater total (p<0.001), periventricular (p<0.001) and deep (p=0.012) WMH progression, and increased CRP (p=0.017). Assessment of individual CAIDE components suggested that markers were driven by different risk factors (WMH/EPVS: age/hypertension, lacunes/deep microbleeds: hypertension/obesity). Interaction analyses demonstrated that higher CAIDE scores amplified the effect of age on SVD, and the effect of WMH on poorer memory. CONCLUSION: Higher CAIDE scores, indicating greater risk of dementia, predicts future progression of both WMH and systemic inflammation. Findings highlight the CAIDE score's potential as both a prognostic and predictive marker in the context of cerebrovascular disease, identifying at-risk individuals who might benefit most from managing modifiable risk.Research grants from the UK Alzheimer's Society, the US Alzheimer’s Association and philanthropic donations. This work was funded by a grant for the PREVENT-Dementia programme from the UK Alzheimer’s Society (grant numbers 178 and 264), and the PREVENT-Dementia study is also supported by the US Alzheimer’s Association (grant number TriBEKa-17–519007) and philanthropic donations. AL is supported by the Lee Kuan Yew Fitzwilliam PhD Scholarship and the Tan Kah Kee Postgraduate Scholarship. JDS is a Wellcome clinical PhD fellow funded on grant 203914/Z/16/Z to the Universities of Manchester, Leeds, Newcastle and Sheffield. EM is supported by Alzheimer’s Society Junior Research Fellowship (RG 9611). LS is supported by the Cambridge NIHR Biomedical Research Centre (BRC) and Alzheimer’s Research UK (ARUK-SRF2017B-1). HSM is supported by an NIHR Senior Investigator award. JOB and HSM receive infrastructural support from the Cambridge NIHR Biomedical Research Centre (BRC). This research was supported by the NIHR Cambridge BRC (BRC-1215-20014). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care

    Computational protein design for structure prediction

    No full text
    Thanks to recent technological breakthroughs and the arrival of new generation sequencers, the amount of genomic data raises exponentially while the gap with the number of solved structures is widening. Ideally, computational 3D structure prediction should be possible with the only sequence information, even without any homology. Indeed, below 30% of sequence identity, similarity measurements are not efficient enough to detect homology. Therefore, it is necessary to implement new methods to take apart the twilight zone. Usually, for a given structure (and so a biological function), only a few existing sequences is known, and barely similar. Thus it is difficult to build a profile in order to find homologues without knowledge of the structure. How can we have databases of sequences for each structure ? The Computational Protein Design (CPD) try to answer this issue : if a fold is known, it is possible to predict every matching sequence ? The CPD consists of recognizing, among all compatible sequences with the wanted fold, those whom will confer to the protein the wanted function. Two steps are needed. The first one consists of calculating some energy matrix holding interaction energies between every pair of residues of the protein by allowing successively all types of amino acids in every possible conformation. The second one, or "optimization step", consists of exploring simultaneously spaces of sequences and conformations in order to determine the best combination of amino acids with the fold given at the beginning. First, the analysis of covariances of alignment positions of theoretical sequences has been managed. We succeeded in the implementation of a statistical method to locate positions that mutate together for a given structure. The profile built with all these theoretical sequences averages too strongly the amino acids data. That is why we improve the homologues searching using groups of sequences classified with the help of patterns located on these positions of covariance. To appreciate the quality of these predictions of theoretical sequences, we had to implement a selection protocol of the best mutated proteins in order to test them in vivo. Nonetheless how can we determine that a sequence is better that another ? What are the relevant criteria ? Thus, a set of descriptors have been chosen to sort the theoretical sequences on the basis of various criteria. Eventually, we got a dozen of sequences. Then, theses mutated proteins have been submitted to molecular dynamics simulations to assess their theoretical stability. For the most encouraging mutated proteins, experimentations took place to get a biological validation of the CPD model : over-expression, purification, structural determination... These protocols of analysis and validation seem to be good means will allow our team to test other mutant proteins in the future. So they can modify parameters during the generation by CPD and lean on experimental results to adjust them.Grâce aux récents progrès technologiques et à l'arrivée des séquenceurs de nouvelle génération, la quantité de données génomiques croît exponentiellement, alors que l'écart avec le nombre de structures résolues se creuse. Dans l'idéal, on aimerait pouvoir prédire par informatique la structure 3D de n'importe quelle protéine à partir de l'information de séquence seule, même en l'absence d'homologie. En effet, en dessous de 30% d'identité de séquence, les mesures de similarité de séquences ne sont plus suffisantes pour détecter l'homologie. Il faut donc mettre en place d'autres méthodes afin de venir à bout de cette zone d'ombre. Pour une structure donnée (et donc une fonction biologique), on ne dispose souvent que d'une petite quantité de séquences natives y correspondant, et parfois assez peu identiques. Il est alors difficile de construire un profil de recherche d'homologues pour retrouver ces séquences dont on ne connaîtrait pas la structure. Alors comment disposer de bases de données de séquences plus conséquentes pour chaque structure ? Ainsi, le design computationnel de protéine (CPD) tente de répondre à cette problématique : si l'on connaît un repliement, est-il possible de retrouver l'ensemble des séquences qui lui correspondent ? Le principe du CPD consiste à identifier parmi toutes les séquences compatibles avec le repliement d'intérêt, celles qui vont conférer à la protéine, la fonction désirée. La procédure générale est réalisée en deux étapes. La première consiste à calculer une matrice d'énergie contenant les énergies d'interactions entre toutes les paires de résidus de la protéine en autorisant successivement tous les types d'acides aminés dans toutes leurs conformations possibles. La seconde étape, ou "phase d'optimisation", consiste à explorer simultanément l'espace des séquences et des conformations afin de déterminer la combinaison optimale d'acides aminés étant donné le repliement de départ. Une première phase d'analyse de covariances de positions d'alignements de séquences théoriques a été menée. Nous avons ainsi pu mettre au point une méthode statistique pour repérer des ensembles de positions qui muteraient ensemble pour une structure donnée. La construction d'un profil avec toutes ces séquences théoriques moyennant trop l'information en acides aminés, nous avons pu améliorer la recherche d'homologues en construisant plusieurs profils à partir de groupes de séquences classées grâce à des motifs sur ces positions considérées comme covariantes. Pour mieux appréhender la qualité de ces prédictions de séquences théoriques, il fallait mettre en place un protocole de sélection des meilleurs protéines mutantes afin de les tester in vivo. Mais comment déterminer qu'une séquence théorique est meilleure qu'une autre? Sur quels critères se baser pour les caractériser? Aussi, un ensemble de descripteurs a été choisi, permettant de trier sur plusieurs critères les séquences théoriques pour n'en choisir qu'une vingtaine. Ensuite, ces protéines mutantes ont été soumises à des simulations de dynamique moléculaire afin d'évaluer leur stabilité théorique. Pour quelques protéines mutantes plus prometteuses, nous avons réalisé des expériences de sur-expression, de purification et de détermination structurale, tentant d'obtenir une validation biologique du modèle de CPD. Ces protocoles d'analyse et de validation semblent être de bons moyens permettront à notre équipe de tester d'autres protéines mutantes dans l'avenir. Ils pourront ainsi modifier des paramètres lors de la génération par CPD et s'appuyer sur des résultats expérimentaux pour les ajuster

    Correlation analysis for the PDZ domain 1QAU.

    No full text
    <p><b>A</b>) Covariance matrix; the amino acid sequence runs along the top and the side of the plot, with secondary structure elements indicated as arrows (strands) or rectangles (helices). Bright points in the matrix correspond to higly-correlated amino acid pairs. Red dots along the top and side label the network shown in B). <b>B</b>) 3D structure with secondary structure elements labelled as in A). A correlated network of five amino acids is shown (yellow spheres, labelled with amino acid number; red dots in A). <b>C</b>) The most frequent sequence patterns for the five amino acids, with their frequency within the 10,000 low energy sequences. <b>D</b>) Number of homologues retrieved by BLAST searching using subsets of sequences that obey one of the frequent patterns (E-value threshold of 1). Homologues retrieved using all the low energy sequences are shown by the rightmost bar (labelled ‘None’). Thick lines represent true homologues; thin lines show false positives.</p

    Individual components of the folding free energy , on a per-residue basis.

    No full text
    <p>Results are for six protein templates (the six bars that appear for each energy term). From left to right: 1CKA and 1CSK (SH3 domains); 1NRV and 1SHD (SH2 domains); 1QAU and 2FE5 (PDZ domains). Dark bars correspond to the 8,000 lowest-energy designed sequences; light bars correspond to native sequences with optimized rotamers. Mean values (kcal/mol) are given above or below each set of columns. The designed and native sequences use opposite sign conventions, for clarity (as if we plotted the negative designed energies).</p

    The four SCOP families studied here.

    No full text
    <p>From left to right: Small Kunitz-type Inhibitors (SKIs), Chemokines, PDZ domains, and Caspases, represented by a single 3D structure (above) or an alignment of five family members (below).</p

    Swissprot sequences retrieved using natural, designed, and random PSSMs.

    No full text
    <p>Number of false positives in parantheses.</p><p><i><sup>a</sup></i>The sequences used to construct the PSSM are either natural sequences from the NR01 database, low-energy designed sequences, or random sequences.</p><p><i><sup>b</sup></i>The designed sequences with the highest CDD scores (Chemokines) or with five SBPs reset to their native types (PDZ domains).</p

    Histograms of the folding free energy, .

    No full text
    <p>Results are shown for designed, native and HMM sequences, for two SH3 domains (1CKA, 1CSK), two SH2 domains (1SHD,1NRV), and two PDZ domains (1QAU, 2FE5). Black: HMM; grey: designed; dashed grey: native; dashed black: HMM sequences after restrained optimization (using 9 amino acid groups). Each panel shows data for two proteins, with opposite vertical axes.</p

    Mean identity score <i>vs.</i> the folding free energy (top) and its components (middle, bottom), for seven proteins.

    No full text
    <p>Results are for the 8,000 lowest-energy designed sequences, which are compared to their corresponding native template. The size of each symbol indicates the number of sequences with the corresponding energy (energies binned in 10 kcal/mol windows). Negative energies indicate stable folding of the designed sequences.</p
    corecore